Installing Dependencies

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")}
if (!requireNamespace("edgeR", quietly = TRUE)){
  BiocManager::install("edgeR")}
if (!requireNamespace("GEOquery", quietly = TRUE)){
  BiocManager::install("GEOquery")}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)){
  BiocManager::install("ComplexHeatmap")}
if (!requireNamespace("limma", quietly = TRUE)){
  BiocManager::install("limma")}
if (!requireNamespace("ExpressionSet", quietly = TRUE)){
  BiocManager::install("ExpressionSet")}
if  (!requireNamespace("circlize", quietly = TRUE)){
  BiocManager::install("circlize")}
if (!requireNamespace("gprofiler2", quietly = TRUE))
    BiocManager::install("gprofiler2")
if (!requireNamespace("kableExtra", quietly = TRUE))
    BiocManager::install("kableExtra")
if (!requireNamespace("plotly", quietly = TRUE))
    BiocManager::install("plotly")
if (!requireNamespace("dplyr", quietly = TRUE))
    BiocManager::install("dplyr")
library(dplyr)
library(plotly)

Part 1: Loading the normalized gene data

The normalized data has already been written into a text file by the previous Assignment 1. For better assessment, I have indicated it on the file uploaded.

normalized_count_data <- read.table(file=file.path(getwd(), "normalized_counts_annotated")
, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE )

knitr::kable(normalized_count_data[1:10,1:7], type="pipe")
hgnc_symbol YRE1A YRE1B YRE2A YRG1A YRG1B YRG2B
A4GALT 4.345648 3.111477 2.408097 5.369569 6.397003 6.796014
AAAS 19.317681 22.327817 21.792253 18.384756 17.419577 17.311220
AACS 32.913059 24.534108 24.466495 23.731644 17.621289 17.428989
AADACP1 2.418554 4.507231 3.916687 13.734990 14.270060 14.143042
AADAT 4.058510 3.995899 3.647402 4.637361 4.052613 3.678086
AAED1 14.351471 11.913866 12.242066 12.711562 8.425186 7.790737
AAGAB 36.922469 35.816951 39.338927 40.358329 31.836730 30.961949
AAK1 1.531217 1.364910 1.534162 2.165459 2.677295 2.667404
AAMDC 46.807715 44.246410 48.924065 66.090858 49.702345 44.178113
AAMP 82.175802 88.352333 93.832788 76.203567 81.078221 82.443384

Part 2: Defining Portions of the Heatmap Matrix

I am defining the elements of the normalized data matrix which I will be using for the heatmap later.

Part 3: Creating Heatmap

The heatmap uses different colors for the expression data. The heatmap has many genes clustered based on their expression at different levels such as control (EA1,EA2,EB1) or the cells treated with GLI2 (G1A,G2A,G1B).

if(min(heatmap_matrix) == 0){
   heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix)),
                           c("white", "purple"))
} else {
  heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix), 0,
         max(heatmap_matrix)), c("darkgreen", "white", "purple"))
}
first_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix),
     show_row_dend = TRUE, show_column_dend = TRUE,
     col=heatmap_col, show_column_names = TRUE,
     show_row_names = FALSE, show_heatmap_legend = TRUE)
first_heatmap

### Part 4: SPP1 EMT Genes-Conclusion justification
SPP1 is an EMT genes It shows higher expression when treated with GLI2 compare to empty vector control cells (EV). This has been also proved by the authors. However, indicating this again would strengthen the hypothesis corresponding to Dox treated cells having GLI2 vector show higher EMT gene expression as opposed to controls having empty vector showing less expression of the genes.

GLI2_treatment_samples <- grep(colnames(normalized_count_data),pattern = "\\G")
EV_tratment_samples <- grep(colnames(normalized_count_data),pattern = "\\E")
gene_of_interest <- which(normalized_count_data$Gene_symbol == "SPP1")

YAPC_GLI2_samples <- t(normalized_count_data[gene_of_interest,5:7])
colnames(YAPC_GLI2_samples) <- c("GLI2_vector_cells")

YAPC_EV_samples <- t(normalized_count_data[gene_of_interest,2:4])
colnames(YAPC_EV_samples) <- c("Empty_Vector_cells")

YAPC_GLI2_samples
##       GLI2_vector_cells
## YRG1A          321.9614
## YRG1B          607.5096
## YRG2B          580.2982
YAPC_EV_samples
##       Empty_Vector_cells
## YRE1A           1.307170
## YRE1B           1.616595
## YRE2A           1.752119

Part 5: T-test between EV and GLI2

GLI2 treated cells are different in terms of their expression from the empty vector cells. P value (0.03135) < 0.05. Indeed this shows there is significant difference. However, this requires to see if GLI2 treatments show difference among one another so the control cells.

t.test(x=t(YAPC_GLI2_samples), y=t(YAPC_EV_samples))
## 
##  Welch Two Sample t-test
## 
## data:  t(YAPC_GLI2_samples) and t(YAPC_EV_samples)
## t = 5.5139, df = 2, p-value = 0.03135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  110.2125 893.1831
## sample estimates:
##  mean of x  mean of y 
## 503.256412   1.558628

PART 6: MDS Plot for Sample Clustering

The MDS plot allowed to visualize the difference within the different cell types Earlier examples of MDS plot showed similar results. I hypothesize that GLI2 expression is higher than control cell’s expression levels due to DOX induction of EMT genes. In general I have observed Gli2 cells to be less clustered, due to higher differential expression among GLI2 cells.

limma::plotMDS(heatmap_matrix,col=c(rep("darkgreen",3), rep("blue",3)))

### Part 7: Model Building of GLI2 expression status The model I am building is based upon GLI2 expression status of genes that are showing higher expression with significant indications of the differential expression.

samples <- data.frame(lapply(colnames(normalized_count_data)[2:7],
           FUN=function(x){unlist(strsplit(x, split = "[0-9]"))[c(1)]}))

colnames(samples) <- colnames(colnames(normalized_count_data)[2:7])
rownames(samples) <- c("Treatments")
samples <- data.frame(t(samples))
knitr::kable(samples, type = "pipe") 
Treatments
YRE
YRE
YRE
YRG
YRG
YRG
YRG_model <- model.matrix( ~ samples$Treatments)
knitr::kable(YRG_model[1:6,], type = "pipe")
(Intercept) samples$TreatmentsYRG
1 0
1 0
1 0
1 1
1 1
1 1

Part 8: Model Fitting and Differential Expression Calculation

The differential expression data matrix corresponding to gene symbols were merged with the matrix that has the fitted model. The Benjamin-hochberg correction method was applied to adjust the expression data.

expressionMatrix <- as.matrix(normalized_count_data[,2:7])
rownames(expressionMatrix) <- normalized_count_data$Gene_symbol
colnames(expressionMatrix) <- colnames(normalized_count_data)[2:7]
set_expression_matrix <- Biobase::ExpressionSet(assayData = expressionMatrix)

fit <- limma::lmFit(set_expression_matrix, YRG_model)

fit_Bayes <- limma::eBayes(fit, trend = TRUE)

BH_method_fit <- limma::topTable(fit_Bayes,
                   coef = ncol(YRG_model),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

merged_hits <- merge(normalized_count_data$Gene_symbol,
                     BH_method_fit,
                     by.y=0,by.x=1,
                     all.y=TRUE)

merged_hits <- merged_hits[order(merged_hits$P.Value),]
colnames(merged_hits) <- c("Gene_symbol","logFC", "AveExpr", "t",
                           "P.Value","adj.P.Val", "B")
rownames(merged_hits) <- 1:nrow(merged_hits)

knitr::kable(merged_hits[1:10,1:7], type = "pipe", row.names = FALSE,)
Gene_symbol logFC AveExpr t P.Value adj.P.Val B
S100A2 192.67873 105.164286 63.79569 0e+00 0.0000245 1.820600
SSFA2 76.45805 75.591839 45.17567 0e+00 0.0000789 1.781362
NT5E 86.27685 123.407858 42.93449 0e+00 0.0000789 1.773004
S100A13 132.80964 179.285328 37.98876 0e+00 0.0001191 1.749192
TRIB2 51.68968 38.062060 35.69100 1e-07 0.0001361 1.734718
HEG1 -13.44713 8.317468 -31.24508 1e-07 0.0002423 1.697467
NTN4 35.83705 33.048568 28.42566 2e-07 0.0003561 1.664696
UCP2 80.78518 94.331418 26.66241 3e-07 0.0004488 1.638993
PLLP -31.42241 35.188668 -25.42904 4e-07 0.0004683 1.617932
UNC5B 19.59576 12.041479 24.78963 5e-07 0.0004683 1.605831

Part 9: P-value Assessment

The p-value was adjusted to have more stringent results. Normally significant p-value is set to 0.05. However, I wanted to be more stringent on my results and indicate the specific EMT genes. Probable genes are 741. However, there are 27 genes that have significant potential to be an EMT gene.

length(which(merged_hits$P.Value < 0.0005))
## [1] 741
length(which(merged_hits$adj.P.Val < 0.0005))
## [1] 27

Part 10: Model Visualization

This part of this assignment aims to detect SPP1’s role on how GLI2 cells are more differently expressed and do have higher difference than the control cells. It is evident that, with more stringent p-value, the significant portion (741) of about 11500 genes are showing features of EMT genes

model_pvalues <- data.frame(Gene_symbol = merged_hits$Gene_symbol, 
pvalue = merged_hits$P.Value)

model_pvalues$color <- "black"
model_pvalues$color[model_pvalues$pvalue < 0.0005] <- "orange"
    
SPP1 <- normalized_count_data$Gene_symbol[which(normalized_count_data$Gene_symbol == "SPP1")]
model_pvalues$color[model_pvalues$Gene_symbol== SPP1] <- "red"

plot(model_pvalues$pvalue,
     col = model_pvalues$color,
     xlab = "Number of Genes within the Dataset",
     ylab ="P-value",
     main="P value distribution of EMT Genes")

points(model_pvalues[which(model_pvalues$Gene_symbol == "SPP1"), 1:2],
       pch = 20, col="red", cex=1.5)
legend(0, 1, legend=c("SPP1"), fill=c("red"))

### Part 11: Clustered Heatmap with More Stringent Values With more stringent gene size I was able to see cleaner difference between GLI2 and the control cells. The clustered genes are the symmetric each other across the diagonals. The high expression showing genes on the GLI2’s side is lower on the EV’ side. Top hits are assessed on the stringent p-value of 0.0005. The difference between the first and the latest heatmap shows there is significant difference in terms of gene expression whose expression is represented for each respective Treatments

top_hits <- merged_hits$Gene_symbol[merged_hits$P.Value < 0.0005]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) 
                       %in% top_hits),])))

if(min(heatmap_matrix_tophits) == 0){
heatmap_col = circlize::colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                                   c( "white", "purple"))
} else{
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0,
     max(heatmap_matrix_tophits)), c("darkgreen", "white", "purple"))
}
latest_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
show_row_dend = TRUE, show_column_dend = TRUE, col=heatmap_col,
show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE)
latest_heatmap

first_heatmap

### Part 12: Dataset Filtering and Data Extracting The previously extracted GSE131222 data from gene omnibus expression (GEO) is extracted separately. The low count portions of the data filtered and a new matrix was created.

data = GEOquery::getGEOSuppFiles("GSE131222")
datanames = rownames(data)

GLI2_exp =read.delim(datanames[1],header=TRUE,check.names = FALSE)

count_per_ms = edgeR::cpm(GLI2_exp[,3:8])
rownames(count_per_ms) <- GLI2_exp[,2]

rows_filtered = rowSums(count_per_ms >1) >= 6
GLI2_exp_filtered = GLI2_exp[rows_filtered,]

filtered_GLI2_matrix <- as.matrix(GLI2_exp_filtered[,3:8])
rownames(filtered_GLI2_matrix) <- GLI2_exp_filtered$gene
head(GLI2_exp_filtered)[1:8]
##         gene                locus     YRE1A     YRE1B     YRE2A     YRG1A
## 7  LINC01128   chr1:762970-794826   3.65922   3.95950   4.17393   3.89816
## 9     KLHL17   chr1:895966-901099   8.94838  10.89430  11.20550   6.99464
## 10   PLEKHN1   chr1:901876-910484   4.90645   5.28867   5.02146   4.31839
## 11     ISG15   chr1:948846-949919 322.87300 251.30200 287.47100 328.22400
## 12      AGRN   chr1:955502-991499 198.02100 221.32500 221.63500 163.25300
## 18   B3GALT6 chr1:1167628-1170420  10.17300  12.26130  12.14880  11.34150
##        YRG1B     YRG2B
## 7    3.33866   3.57704
## 9   10.45200  10.27520
## 10   4.29991   4.14435
## 11 562.14200 586.97600
## 12 214.04600 219.62400
## 18  12.56020  11.78190

Part 13: Grouping Based on the Cell Types

The similar grouping was done previously. However this time dispersion values, normalization factors and differential expression model fitting is applied to the samples of GLI2 treatments.

grouped_treatments <- data.frame(lapply(colnames(GLI2_exp_filtered)[3:8],
                FUN=function(x){unlist(strsplit(x, split = "[0-9]"))[c(1)]}))
colnames(grouped_treatments) <- colnames(GLI2_exp)[3:8]
rownames(grouped_treatments) <- c("cell_type")
grouped_treatments_normalized <- data.frame(t(grouped_treatments))
d = edgeR::DGEList(counts=filtered_GLI2_matrix, group = grouped_treatments_normalized$cell_type)

model_design_celltypes <- model.matrix(~grouped_treatments_normalized$cell_type+0)
d <- edgeR::estimateDisp(d, model_design_celltypes)
d <- edgeR::calcNormFactors(d)
fit <- edgeR::glmQLFit(d, model_design_celltypes)
qlf_test.GLI2vsEV <- edgeR::glmQLFTest(fit, coef = "grouped_treatments_normalized$cell_typeYRG")

Part 14: Significant Results within the Fitted Model

With more stringent p-value I was able to assess the critical threshold for the expression threshold. I set the p-value as 0.0005. I decided to put it by adjusting and observing any difference. The difference between p-value 0.005 and p-value 0.0005 is one. This means this p-value is safe to assume as a threshold for significance among these differentially expressing genes.

qlf_top_hits <- edgeR::topTags(qlf_test.GLI2vsEV,sort.by = "PValue", n = nrow(filtered_GLI2_matrix))

head(qlf_top_hits)
## Coefficient:  grouped_treatments_normalized$cell_typeYRG 
##            logFC   logCPM         F       PValue         FDR
## ADAM15 -14.66671 5.427443 1252636.1 8.490905e-20 3.69538e-16
## LAMB1  -14.63332 5.416059 1201380.1 9.840550e-20 3.69538e-16
## RRAGA  -14.43536 5.426032 1033729.1 1.672979e-19 3.69538e-16
## SMAP2  -14.84240 5.427397  999721.4 1.882720e-19 3.69538e-16
## DDX18  -14.60997 5.420251  920508.0 2.519831e-19 3.69538e-16
## CASC4  -14.51764 5.430442  856813.2 3.245837e-19 3.69538e-16
length(which(qlf_top_hits$table$PValue < 0.0005))
## [1] 2434
length(which(qlf_top_hits$table$FDR < 0.0005))
## [1] 2433
length(which(qlf_top_hits$table$logFC < 0))
## [1] 11527

Part 15: Thresholded Gene List and Final EMT Gene Detection

The threshold was calculated and based its value I assessed the how many genes actually show traits similar to GLI2. I used P value 1 to eliminate significant portion of the differentially expressed genes. I did not use the LogFC to distinguish between EMT GLI2 vs non-EMT GLI2, because all of logFC values are negative. This indicates there is always up regulation after the Dox treatment and GLI2 expression is always higher than the control. Pretty high confidence.

qlf_tophits_withgn <- merge(GLI2_exp[,1:1], qlf_top_hits, by.y=0,by.x=1,all.y=TRUE)

colnames(qlf_tophits_withgn) <- c("Gene_symbol", "logFC", "logCPM","F", "PValue", "FDR")

qlf_tophits_withgn[,"logarithmic"] <- (-log(qlf_tophits_withgn$PValue))
qlf_tophits_withgn <- qlf_tophits_withgn[order(qlf_tophits_withgn$Gene_symbol),]

GLI2_emt_Genes <- qlf_tophits_withgn$Gene_symbol[which(qlf_tophits_withgn$PValue < 0.0005
                & qlf_tophits_withgn$logFC < 0)]
GLI2_non_emtgenes <- qlf_tophits_withgn$Gene_symbol[which(qlf_tophits_withgn$PValue == 1)]

write.table(x=GLI2_emt_Genes, file = file.path(getwd(),"data","GLI2_emt_Genes.txt"),
             sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x=GLI2_non_emtgenes, file = file.path(getwd(),"data","GLI2_non_emtgenes.txt"),
                   sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
genename <- qlf_tophits_withgn$Gene_symbol[18:nrow(qlf_tophits_withgn)]
genename <- unlist(strsplit(genename, split = ","))
genename <- data.frame(genename)
qlf_tophits_withgn <- merge(genename$genename,qlf_tophits_withgn[,7], by.y=0,by.x=0,all.y=TRUE)
qlf_tophits_withgn <- qlf_tophits_withgn[,2:3]
colnames(qlf_tophits_withgn)[1] <- "Genename"
colnames(qlf_tophits_withgn)[2] <- "Logarithmic"
write.table(x=qlf_tophits_withgn,file=file.path("data","logarithmic_genelist.rnk"),
    sep = "\t",append=FALSE,row.names = FALSE,col.names = FALSE,quote = FALSE)

Part 16: GProfiler for Upregulation and Downregulation pathway analysis

The GProfiler analysis includes BH correction methods (FDR). I sticked using the same method because I used the same type of method on my previous analysis on FDR portion. For both upregualted and the downregulated genes I have used GO:BP (2022-12-04), Reactome, and WikiPathways( for annotation. These annotations allows for better asessment of my Upregualted genes in terms of defining the biologic pathways they are invovled in.

Upregulated_genes <- read.table(file=file.path(getwd(),"data" ,"GLI2_emt_Genes.txt"),header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
Upregulated_genes <- Upregulated_genes[4:nrow(Upregulated_genes),]
Upregulated_Gprofiler_validated <- gprofiler2::gost(query = Upregulated_genes, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  sources = c("GO:BP", "REAC", "WP"),
                                  correction_method = "fdr",
                                  ordered_quer = FALSE,
                                  )

Continuing Analysis

Downregulated_genes <- read.table(file=file.path(getwd(),"data" ,"GLI2_non_emtgenes.txt"),header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
Downregulated_genes <- Downregulated_genes[13:nrow(Downregulated_genes),]
Downregulated_Gprofiler_validated <- gprofiler2::gost(query = Downregulated_genes,
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  ordered_quer = FALSE,
                                  source = c("GO:BP", "REAC", "WP"))

Part 17: Subsetting the Gprofiler downregulated and upregualted genes

After subsetting I do have 771 genes that are down regulated. Down regualted genes are invovled in common biological pathways that are not involved in cancerous activities.

Downregulated_genes_filtered <- data.frame(
  term_name = Downregulated_Gprofiler_validated$result$term_name[Downregulated_Gprofiler_validated$result$term_size < 200 &                                                                         Downregulated_Gprofiler_validated$result$term_size > 1],
   term_id  = Downregulated_Gprofiler_validated$result$term_id[Downregulated_Gprofiler_validated$result$term_size < 200 & 
              Downregulated_Gprofiler_validated$result$term_size > 1],
  source  = Downregulated_Gprofiler_validated$result$source[Downregulated_Gprofiler_validated$result$term_size < 200 &
              Downregulated_Gprofiler_validated$result$term_size > 1]
)
length(Downregulated_genes_filtered$term_name)
## [1] 771
knitr::kable(Downregulated_genes_filtered$term_name[1:50], format = "html")
x
tRNA metabolic process
cell cycle checkpoint signaling
DNA-templated transcription elongation
tRNA processing
regulation of DNA-templated transcription elongation
DNA-templated DNA replication
DNA integrity checkpoint signaling
tRNA modification
regulation of transcription elongation by RNA polymerase II
transcription elongation by RNA polymerase II
RNA modification
negative regulation of mitotic cell cycle
DNA damage checkpoint signaling
phosphatidylinositol biosynthetic process
signal transduction in response to DNA damage
mitotic cell cycle checkpoint signaling
protein acetylation
peptidyl-lysine acetylation
glycerophospholipid biosynthetic process
phosphatidylinositol metabolic process
negative regulation of mitotic cell cycle phase transition
histone acetylation
positive regulation of transcription elongation by RNA polymerase II
rRNA processing
internal peptidyl-lysine acetylation
microtubule organizing center organization
internal protein amino acid acetylation
autophagosome organization
mitotic DNA integrity checkpoint signaling
G1/S transition of mitotic cell cycle
nucleic acid phosphodiester bond hydrolysis
centrosome cycle
positive regulation of DNA-templated transcription, elongation
autophagosome assembly
RNA phosphodiester bond hydrolysis
tRNA methylation
protein deacetylation
protein deacylation
macromolecule deacylation
mitotic DNA damage checkpoint signaling
DNA-templated DNA replication maintenance of fidelity
histone deacetylation
RNA methylation
histone H3 acetylation
regulation of DNA repair
double-strand break repair via homologous recombination
post-Golgi vesicle-mediated transport
recombinational repair
vacuole organization
positive regulation of transcription initiation by RNA polymerase II

For Upregulation

After subsetting I do have 1128 genes that are down regulated. Up regualted genes are mostly invovled in cancer activity. Most of the pathwyas are quite invovled in cancerous activty. There is a lot apoptosis and programmed cell death activities.

Upregulated_genes_filtered <- data.frame(
  term_name = Upregulated_Gprofiler_validated$result$term_name[Upregulated_Gprofiler_validated$result$term_size < 200 &                                                                         Upregulated_Gprofiler_validated$result$term_size > 1],
   term_id  = Upregulated_Gprofiler_validated$result$term_id[Upregulated_Gprofiler_validated$result$term_size < 200 & 
              Upregulated_Gprofiler_validated$result$term_size > 1],
  source  = Upregulated_Gprofiler_validated$result$source[Upregulated_Gprofiler_validated$result$term_size < 200 &
              Upregulated_Gprofiler_validated$result$term_size > 1]
)

length(Upregulated_genes_filtered$term_name)
## [1] 1128
knitr::kable(Upregulated_Gprofiler_validated$result$term_name[1:50], format = "html")
x
organonitrogen compound biosynthetic process
peptide metabolic process
translation
peptide biosynthetic process
organonitrogen compound metabolic process
amide metabolic process
amide biosynthetic process
cytoplasmic translation
cellular macromolecule metabolic process
cellular macromolecule biosynthetic process
protein metabolic process
metabolic process
cellular metabolic process
ATP synthesis coupled electron transport
mitochondrial ATP synthesis coupled electron transport
nitrogen compound metabolic process
respiratory electron transport chain
aerobic respiration
regulation of protein metabolic process
catabolic process
electron transport chain
aerobic electron transport chain
cellular respiration
oxidative phosphorylation
organic substance metabolic process
primary metabolic process
macromolecule catabolic process
protein localization
cellular macromolecule localization
macromolecule localization
organic substance catabolic process
protein catabolic process
establishment of protein localization
cellular catabolic process
programmed cell death
cell death
ribonucleoprotein complex biogenesis
cellular localization
organonitrogen compound catabolic process
apoptotic process
protein-containing complex organization
protein-containing complex assembly
intracellular transport
regulation of programmed cell death
generation of precursor metabolites and energy
energy derivation by oxidation of organic compounds
protein transport
cellular component biogenesis
ribose phosphate biosynthetic process
regulation of cell death

Part 18: Visualizations of the upregualted and the downregulated genes

gprofiler2::gostplot(Upregulated_Gprofiler_validated) %>% plotly::layout(title = "Upregulated genes ", font = list(size = 10))

Continuing

gprofiler2::gostplot(Downregulated_Gprofiler_validated) %>% plotly::layout(title = "Downregulated genes ", font = list(size = 10))

Part 19: GProfiler Analyses for the merged dataset

Complete_Gprofiler_Analyses <- gprofiler2::gost(query = merged_hits$Gene_symbol, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  sources = c("GO:BP", "REAC", "WP"))

Complete_Gprofiler_Analyses_filtered <- data.frame(term_name = Complete_Gprofiler_Analyses$result$term_name[Complete_Gprofiler_Analyses$result$term_size < 200 &
                                               Complete_Gprofiler_Analyses$result$term_size > 1],
  term_id = Complete_Gprofiler_Analyses$result$term_id[Complete_Gprofiler_Analyses$result$term_size < 200 &
                                           Complete_Gprofiler_Analyses$result$term_size > 1],
  source = Complete_Gprofiler_Analyses$result$source[Complete_Gprofiler_Analyses$result$term_size < 200 &
                                         Complete_Gprofiler_Analyses$result$term_size > 1])

length(Complete_Gprofiler_Analyses_filtered$term_name)
## [1] 1927
knitr::kable(Complete_Gprofiler_Analyses$result$term_name[1:100], format = "html")
x
cellular metabolic process
primary metabolic process
metabolic process
cellular macromolecule metabolic process
nitrogen compound metabolic process
organic substance metabolic process
macromolecule metabolic process
cellular nitrogen compound metabolic process
organonitrogen compound metabolic process
protein metabolic process
heterocycle metabolic process
nucleobase-containing compound metabolic process
cellular biosynthetic process
cellular response to stress
organic substance biosynthetic process
cellular aromatic compound metabolic process
biosynthetic process
organic cyclic compound metabolic process
intracellular transport
nucleic acid metabolic process
macromolecule modification
cellular catabolic process
catabolic process
establishment of localization in cell
macromolecule catabolic process
macromolecule biosynthetic process
protein modification process
cellular macromolecule catabolic process
cellular nitrogen compound biosynthetic process
RNA processing
cellular localization
gene expression
organonitrogen compound biosynthetic process
cellular macromolecule biosynthetic process
organelle organization
organic substance catabolic process
RNA metabolic process
cellular process
cellular response to DNA damage stimulus
protein localization to organelle
regulation of cellular metabolic process
translation
cellular macromolecule localization
regulation of nitrogen compound metabolic process
protein localization
mitotic cell cycle
peptide biosynthetic process
amide biosynthetic process
regulation of primary metabolic process
macromolecule localization
protein catabolic process
ribonucleoprotein complex biogenesis
mitotic cell cycle process
proteolysis involved in protein catabolic process
peptide metabolic process
modification-dependent macromolecule catabolic process
cellular component organization or biogenesis
amide metabolic process
modification-dependent protein catabolic process
intracellular protein transport
organonitrogen compound catabolic process
cell cycle
ubiquitin-dependent protein catabolic process
DNA metabolic process
regulation of metabolic process
regulation of cell cycle
mRNA metabolic process
positive regulation of nitrogen compound metabolic process
regulation of protein metabolic process
establishment of protein localization
regulation of macromolecule metabolic process
cellular component biogenesis
positive regulation of cellular metabolic process
positive regulation of nucleobase-containing compound metabolic process
ribosome biogenesis
protein transport
DNA repair
cell cycle process
cellular component organization
regulation of catabolic process
positive regulation of metabolic process
proteasomal protein catabolic process
ncRNA metabolic process
proteasome-mediated ubiquitin-dependent protein catabolic process
positive regulation of macromolecule biosynthetic process
positive regulation of cellular biosynthetic process
positive regulation of macromolecule metabolic process
protein modification by small protein conjugation or removal
regulation of cellular catabolic process
positive regulation of biosynthetic process
heterocycle biosynthetic process
regulation of macromolecule biosynthetic process
RNA splicing
process utilizing autophagic mechanism
autophagy
protein modification by small protein conjugation
regulation of cellular biosynthetic process
regulation of organelle organization
organic cyclic compound biosynthetic process
establishment of protein localization to organelle

Continuing

gprofiler2::gostplot(Complete_Gprofiler_Analyses) %>% plotly::layout(title = "Complete plot plot", font = list(size = 10))

Part 20:Interpretation

Do the over-representation results support conclusions or mechanism discussed in the original paper ?

Yes the over-representation results we find is correlated to the conclusion the authors make. They conclude that genes having higher GLI2 expression, more differentially expressed, are more possible to be EMT genes. Additionally, we found there is always higher expression on GLI2 treatments given this is a strong oncogene acting in basal-sub-type switching triggering EMT switching. As it is obvious from the visualization the upregualted genes are more clustered and are much more in the downregualted genes This supports the hypothesis of authors, concluding the EMT genes to be higher expression than that of the downregualted genes.

Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results ?

The paper (Pasca di Magliano et al.,2006) has already indicated increase expression of GLI2 can lead to SHH gene reduction. This gene has been known to be an oncogene associated with basal-sub-type switching in PDAC. They activate GLI2 constantly which has the same expected outcome in our data set having GLI2 expressing vectors with constant Dox treatment. In conclusion these two paper findings correlate each other. They show that the EMT genes to be abundant in their biological functions and their pathway invovlement.

References

Adams, Christina R, Htet Htwe Htwe, Timothy Marsh, Aprilgate L Wang, Megan L Montoya, Lakshmipriya Subbaraj, Aaron D Tward, Nabeel Bardeesy, and Rushika M Perera. 2019. “Transcriptional Control of Subtype Switching Ensures Adaptation and Growth of Pancreatic Cancer.” Elife 8: e45313.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.
Davis, S., and P. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Durinck, S., P. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Gu, Zuguang. 2022. “Complex Heatmap Visualization.” iMeta 1 (3): e43.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30 (19): 2811–12.
Magliano, Marina Pasca di, Shigeki Sekine, Alexandre Ermilov, Jenny Ferris, Andrzej A Dlugosz, and Matthias Hebrok. 2006. “Hedgehog/Ras Interactions Regulate Early Stages of Pancreatic Cancer.” Genes & Development 20 (22): 3161–73.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
Robinson, M. D., McCarthy DJ, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26: 139–40.